Using Twitter to Collect a Multi-Dialectal Corpus of Arabic
نویسندگان
چکیده
This paper describes the collection and classification of a multi-dialectal corpus of Arabic based on the geographical information of tweets. We mapped information of user locations to one of the Arab countries, and extracted tweets that have dialectal word(s). Manual evaluation of the extracted corpus shows that the accuracy of assignment of tweets to some countries (like Saudi Arabia and Egypt) is above 93% while the accuracy for other countries, such Algeria and Syria is below 70%.
منابع مشابه
A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic
This paper presents a multi-dialect, multi-genre, human annotated corpus of dialectal Arabic with data obtained from both online newspaper commentary and Twitter. Most Arabic corpora are small and focus on Modern Standard Arabic (MSA). There has been recent interest, however, in the construction of dialectal Arabic corpora (Zaidan and Callison-Burch, 2011a; Al-Sabbagh and Girju, 2012). This wor...
متن کاملArabizi Identification in Twitter Data
In this work we explore some challenges related to analysing one form of the Arabic language called Arabizi. Arabizi, a portmanteau of Araby-Englizi, meaning Arabic-English, is a digital trend in texting Non-Standard Arabic using Latin script. Arabizi users express their natural dialectal Arabic in text without following a unified orthography. We address the challenge of identifying Arabizi fro...
متن کاملYouDACC: the Youtube Dialectal Arabic Comment Corpus
In the Arab world, while Modern Standard Arabic is commonly used in formal written context, on sites like Youtube, people are increasingly using Dialectal Arabic, the language for everyday use to comment on a video and interact with the community. These user-contributed comments along with the video and user attributes, offer a rich source of multi-dialectal Arabic sentences and expressions fro...
متن کاملSentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments
Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political inte...
متن کاملYouDACC: the Youtube Dialectal Arabic Commentary Corpus
In the Arab world, while Modern Standard Arabic is commonly used in formal written context, on sites like Youtube, people are increasingly using Dialectal Arabic, the language for everyday use to comment on a video and interact with the community. These user-contributed comments along with the video and user attributes, offer a rich source of multi-dialectal Arabic sentences and expressions fro...
متن کامل